Today we will…
You will be completing a final project in Stat 331/531 in teams of four. More details to come soon!
In general, coming up submissions to be aware of:
Don’t include pages of output
Use the code chunk option
#| output: false
#| results: false
Data Description Components
Make sure to include context when describing the data set as well as the data characteristics.
mutate() vs summarise()Read
Average
Total
Which or For Each
Minimum
Maximum
Minimum and Maximum
Think
summarize(avg_var = mean())
summarize(total = sum())
group_by()
slice_min()
slice_max()
arrange() |> slice(1,n())
Bar plots are typically reserved for displaying frequencies
# A tibble: 4 × 3
geography mean_price_diff sd_price_diff
<fct> <dbl> <dbl>
1 San Francisco 0.719 0.334
2 San Diego 0.685 0.211
3 Sacramento 0.578 0.270
4 Los Angeles 0.528 0.188
Read more about Cleveland Dot Plots
diff_summary |>
arrange(desc(mean_price_diff)) |>
ggplot(aes(x = mean_price_diff,
y = geography,
fill = geography)
) +
geom_segment(aes(xend = 0,
yend = geography)
) +
geom_point() +
labs(subtitle = "Geography",
x = "Difference in Price ($)\nOrganic - Conventional",
y = "") +
theme_minimal() +
theme(legend.position = "none") +
scale_fill_brewer(palette = "Dark2")library(forcats) cheatsheetCommon tasks
Turn a character or numeric variable into a factor
Make a factor by discritizing / “binning” a numeric variable
Rename or reorder the levels of an existing factor
Note
The packages forcats (“for categoricals”) gives nice shortcuts for wrangling categorical variables.
forcats loads with the tidyverse!factor[1] "apple" "dog" "banana" "cat"
[5] "banana" "Queen Elizabeth" "dog"
fct_recode()new level = old level
[1] fruit pet fruit pet
[5] fruit Queen Elizabeth pet
Levels: fruit pet Queen Elizabeth
Note
Notice Queen Elizabeth is a “remaining” level that was never recoded.
fct_relevel()tidyverse'data.frame': 77 obs. of 16 variables:
$ name : Factor w/ 77 levels "100% Bran","100% Natural Bran",..: 1 2 3 4 5 6 7 8 9 10 ...
$ manuf : Factor w/ 7 levels "A","G","K","N",..: 4 6 3 3 7 2 3 2 7 5 ...
$ type : Factor w/ 2 levels "cold","hot": 1 1 1 1 1 1 1 1 1 1 ...
$ calories: int 70 120 70 50 110 110 110 130 90 90 ...
$ protein : int 4 3 4 4 2 2 2 3 2 3 ...
$ fat : int 1 5 1 0 2 2 0 2 1 0 ...
$ sodium : int 130 15 260 140 200 180 125 210 200 210 ...
$ fiber : num 10 2 9 14 1 1.5 1 2 4 5 ...
$ carbo : num 5 8 7 8 14 10.5 11 18 15 13 ...
$ sugars : int 6 8 5 0 8 10 14 8 6 5 ...
$ potass : int 280 135 320 330 -1 70 30 100 125 190 ...
$ vitamins: int 25 0 25 25 25 25 25 25 25 25 ...
$ shelf : int 3 3 3 3 3 1 2 3 1 3 ...
$ weight : num 1 1 1 1 1 1 1 1.33 1 1 ...
$ cups : num 0.33 1 0.33 0.5 0.75 0.75 1 0.75 0.67 0.67 ...
$ rating : num 68.4 34 59.4 93.7 34.4 ...
cereal_casewhen <- cereal |>
mutate(manuf = case_when(manuf == "A" ~ "American Home Food Products",
manuf == "G" ~ "General Mills",
manuf == "K" ~ "Kelloggs",
manuf == "N" ~ "Nabisco",
manuf == "P" ~ "Post",
manuf == "Q" ~ "Quaker Oats",
manuf == "R" ~ "Ralston Purina"
),
manuf = as.factor(manuf)
)
summary(cereal_casewhen$manuf)American Home Food Products General Mills
1 22
Kelloggs Nabisco
23 6
Post Quaker Oats
9 8
Ralston Purina
8
cereal_recode <- cereal |>
mutate(manuf = fct_recode(manuf,
"American Home Food Products" = "A",
"General Mills" = "G",
"Kelloggs" = "K",
"Nabisco" = "N",
"Post" = "P",
"Quaker Oats" = "Q",
"Ralston Purina" = "R"
)
)
summary(cereal_recode$manuf)American Home Food Products General Mills
1 22
Kelloggs Nabisco
23 6
Post Quaker Oats
9 8
Ralston Purina
8
ggplot2Disclaimer: fix your axes and legend labels!
ggplot2Disclaimer: fix your axes and legend labels!
We will be working with the survey.csv data from Lab 2: Exploring Rodents with ggplot2 to improve our plots!
See Will Chase’s 2020 RStudio Conference Presentation - Glamour of Graphics
You will be asked to “sketch your game plan” with https://excalidraw.com/.
Danger
You will be required to use functions from the {forcats} package! e.g. reorder() is a no go, use fct_reorder instead!
library(lubridate)Common Tasks
Convert a date-like variable (“May 8, 1995”) to a special DateTime Object.
Find the weekday, month, year, etc from a DateTime object
Convert between timezones
datetime ObjectsThere are actually three data types (classes) in R for dates and datetimes.
Date (duh)
POSIXlt (???)
and POSIXct (???)
POSIXlt and POSIXctPOSIXct – stores date/time values as the number of seconds since January 1, 1970 (“Unix Epoch”)
POSIXlt – stores date/time values as a list with elements for second, minute, hour, day, month, and year, among others.
lubridate!What is wrong with these two code chunks?
next birthday…
[1] Monday
7 Levels: Sunday < Monday < Tuesday < Wednesday < Thursday < ... < Saturday
hundredth…
One of the most famous mysteries in California history is the identity of the so-called “Zodiac Killer”, who murdered 7 people in Northern California between 1968 and 1969. A new murder was committed last year in California, suspected to be the work of a new Zodiac Killer on the loose.
Unfortunately, the date and time of the murder is not known. You have been hired to crack the case. Use the clues below to discover the murderer’s identity.
Submit the name of the killer to the Canvas Quiz.
Today we will…
stringrA string is a bunch of characters.
Don’t confuse a string (many characters, one object) with a character vector (vector of strings).
stringrCommon tasks
Find which strings contain a particular pattern
Remove or replace a pattern
Edit a string (for example, make it lowercase)
Note
The package stringr is very useful for strings!
stringr loads with the tidyverse.
all the functions are str_xxx().
pattern =The pattern argument in all of the stringr functions …
Note
Discuss with a neighbor. For each of these functions, give:
str_detect()Returns logical vector TRUE/FALSE indicating if the pattern was found in that element of the original vector
my_vector <- c("Hello,", "my name is", "Bond", "James Bond")
str_detect(my_vector, pattern = "Bond")[1] FALSE FALSE TRUE TRUE
filter()summarise() and sum or meanRelated functions
str_subset() returns just the strings that contain the match
str_which() returns the indexes of strings that have a match
str_match()Returns character matrix with either NA or the pattern, depending on if the pattern was found.
str_extract()Returns character vector with either NA or the pattern, depending on if the pattern was found.
my_vector <- c("Hello,", "my name is", "Bond", "James Bond")
str_extract(my_vector, pattern = "Bond")[1] NA NA "Bond" "Bond"
Warning
str_extract() only returns the first pattern match; use str_extract_all() to return every pattern match.
str_locate()Returns a date frame with two numeric variables for the starting and ending location, giving either NA or the start and end position of the pattern.
str_subset()Returns a character vector with a subset of the original character vector with elements where the pattern occurs.
my_vector <- c("Hello,", "my name is", "Bond", "James Bond")
str_subset(my_vector, pattern = "Bond")[1] "Bond" "James Bond"
Related Functions
str_sub() extracts values based on location.
str_replace(x, pattern = "", replace = "")
replaces the first matched pattern
mutate()Related functions
str_replace_all() replaces all matched patterns
str_remove_all() removes all matched patterns
Convert letters in the string to a specific capitalization format.
converts all letters in the strings to lowercase
converts all letters in the strings to uppercase
Joins multiple strings into a single string.
Combines into a single string.
[1] "Hello, my name is Bond James Bond"
Note
str_c() will do the same thing, but it it is encouraged to use str_flatten() instead.
Uses environment to create a string and evaluates {expressions}.
My name is Bond, James Bond
Tip
See the R package glue!
Refer to the stringr cheatsheet
Remember that str_xxx functions need the first argument to be a vector of strings, not a data set.
filter() or mutate(). name is_bran manuf type calories protein
1 100% Bran TRUE N cold 70 4
2 100% Natural Bran TRUE Q cold 120 3
3 All-Bran TRUE K cold 70 4
4 All-Bran with Extra Fiber TRUE K cold 50 4
5 Almond Delight FALSE R cold 110 2
6 Apple Cinnamon Cheerios FALSE G cold 110 2
7 Apple Jacks FALSE K cold 110 2
8 Basic 4 FALSE G cold 130 3
9 Bran Chex TRUE R cold 90 2
10 Bran Flakes TRUE P cold 90 3
11 Cap'n'Crunch FALSE Q cold 120 1
12 Cheerios FALSE G cold 110 6
13 Cinnamon Toast Crunch FALSE G cold 120 1
14 Clusters FALSE G cold 110 3
15 Cocoa Puffs FALSE G cold 110 1
16 Corn Chex FALSE R cold 110 2
17 Corn Flakes FALSE K cold 100 2
18 Corn Pops FALSE K cold 110 1
19 Count Chocula FALSE G cold 110 1
20 Cracklin' Oat Bran TRUE K cold 110 3
21 Cream of Wheat (Quick) FALSE N hot 100 3
22 Crispix FALSE K cold 110 2
23 Crispy Wheat & Raisins FALSE G cold 100 2
24 Double Chex FALSE R cold 100 2
25 Froot Loops FALSE K cold 110 2
26 Frosted Flakes FALSE K cold 110 1
27 Frosted Mini-Wheats FALSE K cold 100 3
28 Fruit & Fibre Dates; Walnuts; and Oats FALSE P cold 120 3
29 Fruitful Bran TRUE K cold 120 3
30 Fruity Pebbles FALSE P cold 110 1
31 Golden Crisp FALSE P cold 100 2
32 Golden Grahams FALSE G cold 110 1
33 Grape Nuts Flakes FALSE P cold 100 3
34 Grape-Nuts FALSE P cold 110 3
35 Great Grains Pecan FALSE P cold 120 3
36 Honey Graham Ohs FALSE Q cold 120 1
37 Honey Nut Cheerios FALSE G cold 110 3
38 Honey-comb FALSE P cold 110 1
39 Just Right Crunchy Nuggets FALSE K cold 110 2
40 Just Right Fruit & Nut FALSE K cold 140 3
41 Kix FALSE G cold 110 2
42 Life FALSE Q cold 100 4
43 Lucky Charms FALSE G cold 110 2
44 Maypo FALSE A hot 100 4
45 Muesli Raisins; Dates; & Almonds FALSE R cold 150 4
46 Muesli Raisins; Peaches; & Pecans FALSE R cold 150 4
47 Mueslix Crispy Blend FALSE K cold 160 3
48 Multi-Grain Cheerios FALSE G cold 100 2
49 Nut&Honey Crunch FALSE K cold 120 2
50 Nutri-Grain Almond-Raisin FALSE K cold 140 3
51 Nutri-grain Wheat FALSE K cold 90 3
52 Oatmeal Raisin Crisp FALSE G cold 130 3
53 Post Nat. Raisin Bran TRUE P cold 120 3
54 Product 19 FALSE K cold 100 3
55 Puffed Rice FALSE Q cold 50 1
56 Puffed Wheat FALSE Q cold 50 2
57 Quaker Oat Squares FALSE Q cold 100 4
58 Quaker Oatmeal FALSE Q hot 100 5
59 Raisin Bran TRUE K cold 120 3
60 Raisin Nut Bran TRUE G cold 100 3
61 Raisin Squares FALSE K cold 90 2
62 Rice Chex FALSE R cold 110 1
63 Rice Krispies FALSE K cold 110 2
64 Shredded Wheat FALSE N cold 80 2
65 Shredded Wheat 'n'Bran TRUE N cold 90 3
66 Shredded Wheat spoon size FALSE N cold 90 3
67 Smacks FALSE K cold 110 2
68 Special K FALSE K cold 110 6
69 Strawberry Fruit Wheats FALSE N cold 90 2
70 Total Corn Flakes FALSE G cold 110 2
71 Total Raisin Bran TRUE G cold 140 3
72 Total Whole Grain FALSE G cold 100 3
73 Triples FALSE G cold 110 2
74 Trix FALSE G cold 110 1
75 Wheat Chex FALSE R cold 100 3
76 Wheaties FALSE G cold 100 3
77 Wheaties Honey Gold FALSE G cold 110 2
fat sodium fiber carbo sugars potass vitamins shelf weight cups rating
1 1 130 10.0 5.0 6 280 25 3 1.00 0.33 68.40297
2 5 15 2.0 8.0 8 135 0 3 1.00 1.00 33.98368
3 1 260 9.0 7.0 5 320 25 3 1.00 0.33 59.42551
4 0 140 14.0 8.0 0 330 25 3 1.00 0.50 93.70491
5 2 200 1.0 14.0 8 -1 25 3 1.00 0.75 34.38484
6 2 180 1.5 10.5 10 70 25 1 1.00 0.75 29.50954
7 0 125 1.0 11.0 14 30 25 2 1.00 1.00 33.17409
8 2 210 2.0 18.0 8 100 25 3 1.33 0.75 37.03856
9 1 200 4.0 15.0 6 125 25 1 1.00 0.67 49.12025
10 0 210 5.0 13.0 5 190 25 3 1.00 0.67 53.31381
11 2 220 0.0 12.0 12 35 25 2 1.00 0.75 18.04285
12 2 290 2.0 17.0 1 105 25 1 1.00 1.25 50.76500
13 3 210 0.0 13.0 9 45 25 2 1.00 0.75 19.82357
14 2 140 2.0 13.0 7 105 25 3 1.00 0.50 40.40021
15 1 180 0.0 12.0 13 55 25 2 1.00 1.00 22.73645
16 0 280 0.0 22.0 3 25 25 1 1.00 1.00 41.44502
17 0 290 1.0 21.0 2 35 25 1 1.00 1.00 45.86332
18 0 90 1.0 13.0 12 20 25 2 1.00 1.00 35.78279
19 1 180 0.0 12.0 13 65 25 2 1.00 1.00 22.39651
20 3 140 4.0 10.0 7 160 25 3 1.00 0.50 40.44877
21 0 80 1.0 21.0 0 -1 0 2 1.00 1.00 64.53382
22 0 220 1.0 21.0 3 30 25 3 1.00 1.00 46.89564
23 1 140 2.0 11.0 10 120 25 3 1.00 0.75 36.17620
24 0 190 1.0 18.0 5 80 25 3 1.00 0.75 44.33086
25 1 125 1.0 11.0 13 30 25 2 1.00 1.00 32.20758
26 0 200 1.0 14.0 11 25 25 1 1.00 0.75 31.43597
27 0 0 3.0 14.0 7 100 25 2 1.00 0.80 58.34514
28 2 160 5.0 12.0 10 200 25 3 1.25 0.67 40.91705
29 0 240 5.0 14.0 12 190 25 3 1.33 0.67 41.01549
30 1 135 0.0 13.0 12 25 25 2 1.00 0.75 28.02576
31 0 45 0.0 11.0 15 40 25 1 1.00 0.88 35.25244
32 1 280 0.0 15.0 9 45 25 2 1.00 0.75 23.80404
33 1 140 3.0 15.0 5 85 25 3 1.00 0.88 52.07690
34 0 170 3.0 17.0 3 90 25 3 1.00 0.25 53.37101
35 3 75 3.0 13.0 4 100 25 3 1.00 0.33 45.81172
36 2 220 1.0 12.0 11 45 25 2 1.00 1.00 21.87129
37 1 250 1.5 11.5 10 90 25 1 1.00 0.75 31.07222
38 0 180 0.0 14.0 11 35 25 1 1.00 1.33 28.74241
39 1 170 1.0 17.0 6 60 100 3 1.00 1.00 36.52368
40 1 170 2.0 20.0 9 95 100 3 1.30 0.75 36.47151
41 1 260 0.0 21.0 3 40 25 2 1.00 1.50 39.24111
42 2 150 2.0 12.0 6 95 25 2 1.00 0.67 45.32807
43 1 180 0.0 12.0 12 55 25 2 1.00 1.00 26.73451
44 1 0 0.0 16.0 3 95 25 2 1.00 1.00 54.85092
45 3 95 3.0 16.0 11 170 25 3 1.00 1.00 37.13686
46 3 150 3.0 16.0 11 170 25 3 1.00 1.00 34.13976
47 2 150 3.0 17.0 13 160 25 3 1.50 0.67 30.31335
48 1 220 2.0 15.0 6 90 25 1 1.00 1.00 40.10596
49 1 190 0.0 15.0 9 40 25 2 1.00 0.67 29.92429
50 2 220 3.0 21.0 7 130 25 3 1.33 0.67 40.69232
51 0 170 3.0 18.0 2 90 25 3 1.00 1.00 59.64284
52 2 170 1.5 13.5 10 120 25 3 1.25 0.50 30.45084
53 1 200 6.0 11.0 14 260 25 3 1.33 0.67 37.84059
54 0 320 1.0 20.0 3 45 100 3 1.00 1.00 41.50354
55 0 0 0.0 13.0 0 15 0 3 0.50 1.00 60.75611
56 0 0 1.0 10.0 0 50 0 3 0.50 1.00 63.00565
57 1 135 2.0 14.0 6 110 25 3 1.00 0.50 49.51187
58 2 0 2.7 -1.0 -1 110 0 1 1.00 0.67 50.82839
59 1 210 5.0 14.0 12 240 25 2 1.33 0.75 39.25920
60 2 140 2.5 10.5 8 140 25 3 1.00 0.50 39.70340
61 0 0 2.0 15.0 6 110 25 3 1.00 0.50 55.33314
62 0 240 0.0 23.0 2 30 25 1 1.00 1.13 41.99893
63 0 290 0.0 22.0 3 35 25 1 1.00 1.00 40.56016
64 0 0 3.0 16.0 0 95 0 1 0.83 1.00 68.23588
65 0 0 4.0 19.0 0 140 0 1 1.00 0.67 74.47295
66 0 0 3.0 20.0 0 120 0 1 1.00 0.67 72.80179
67 1 70 1.0 9.0 15 40 25 2 1.00 0.75 31.23005
68 0 230 1.0 16.0 3 55 25 1 1.00 1.00 53.13132
69 0 15 3.0 15.0 5 90 25 2 1.00 1.00 59.36399
70 1 200 0.0 21.0 3 35 100 3 1.00 1.00 38.83975
71 1 190 4.0 15.0 14 230 100 3 1.50 1.00 28.59278
72 1 200 3.0 16.0 3 110 100 3 1.00 1.00 46.65884
73 1 250 0.0 21.0 3 60 25 3 1.00 0.75 39.10617
74 1 140 0.0 13.0 12 25 25 2 1.00 1.00 27.75330
75 1 230 3.0 17.0 3 115 25 1 1.00 0.67 49.78744
76 1 200 3.0 17.0 3 110 25 1 1.00 1.00 51.59219
77 1 200 1.0 16.0 8 60 25 1 1.00 0.75 36.18756
“Regexps are a very terse language that allow you to describe patterns in strings.”
R for Data Science
R uses “extended” regular expressions, which are common.
pattern = "REGEX GOES HERE"
Web app to test R regular expressions
Tip
Regular expressions are a reason to use stringr!
You might encounter gsub(), grep(), etc. from Base R.
. ^ $ \ | * + ? { } [ ] ( )[1] "She" "sells" "seashells" "by" "the" "seashore!"
. Represents any character
[1] "She" "sells" "seashells" "by" "the" "seashore!"
^ Looks at the beginning
$ Looks at the end
[1] "shes" "shels" "shells" "shellls" "shelllls"
? Occurs 0 or 1 times
+ Occurs 1 or more times
* Occurs 0 or more times
[1] "shes" "shels" "shells" "shellls" "shelllls"
{n} matches exactly n times.
{n,} matches at least n times.
{n,m} matches between n and m times.
()Groups can be created with ( )
| – “either” / “or”
toung_twister2 <- c("Peter", "Piper", "picked", "a", "peck", "of", "pickled", "peppers!")
toung_twister2[1] "Peter" "Piper" "picked" "a" "peck" "of" "pickled"
[8] "peppers!"
[][] Character Classes\w Looks for any “word” (conversely “not” “word” \W)
\d Looks for any digit (conversely “not” digit \D)
\s Looks for any whitespace (conversely “not” whitespace \S)
Write a regular expressions that search for words that do the following:
Test your answers out on
\In order to match a special character you need to “escape” first
Warning
In general, look at punctuation characters with suspicion.
[1] "How" "much" "wood" "could" "a" "woodchuck"
[7] "chuck" "if" "a" "woodchuck" "could" "chuck"
[13] "wood?"
Error in stri_subset_regex(string, pattern, omit_na = TRUE, negate = negate, : Syntax error in regex pattern. (U_REGEX_RULE_SYNTAX, context=`?`)
str_view() and str_view_all()Read the regular expressions out loud like a “request”
Test out your expressions on small examples first
str_view() and str_view_all()I use the stringr cheatsheet more than any other package cheatsheet!
tidyversematches(pattern)Selects all variables with a name that matches the supplied pattern
select(), rename_with(), and across()I received this data from a grad school colleague the other day who asked if I knew how to “clean” it.
What is that column?! 😮
[{'variant': 'Other', 'cumWeeklySequenced': 2366843.0, 'newWeeklyPercentage': 4.59}, {'variant': 'V-20DEC-01 (Alpha)', 'cumWeeklySequenced': 0.0, 'newWeeklyPercentage': 0.0}, {'variant': 'V-21APR-02 (Delta B.1.617.2)', 'cumWeeklySequenced': 0.0, 'newWeeklyPercentage': 0.0}, {'variant': 'V-21OCT-01 (Delta AY 4.2)', 'cumWeeklySequenced': 0.0, 'newWeeklyPercentage': 0.0}, {'variant': 'V-22DEC-01 (Omicron CH.1.1)', 'cumWeeklySequenced': 2366843.0, 'newWeeklyPercentage': 24.56}, {'variant': 'V-22JUL-01 (Omicron BA.2.75)', 'cumWeeklySequenced': 2366843.0, 'newWeeklyPercentage': 8.93}, {'variant': 'V-22OCT-01 (Omicron BQ.1)', 'cumWeeklySequenced': 2366843.0, 'newWeeklyPercentage': 49.57}, {'variant': 'VOC-21NOV-01 (Omicron BA.1)', 'cumWeeklySequenced': 2366843.0, 'newWeeklyPercentage': 0.02}, {'variant': 'VOC-22APR-03 (Omicron BA.4)', 'cumWeeklySequenced': 2366843.0, 'newWeeklyPercentage': 0.08}, {'variant': 'VOC-22APR-04 (Omicron BA.5)', 'cumWeeklySequenced': 2366843.0, 'newWeeklyPercentage': 5.59}, {'variant': 'VOC-22JAN-01 (Omicron BA.2)', 'cumWeeklySequenced': 2366843.0, 'newWeeklyPercentage': 1.41}, {'variant': 'unclassified_variant', 'cumWeeklySequenced': 2366843.0, 'newWeeklyPercentage': 5.26}]
stringr! 🥳Let’s see how this works.
In this activity, you will be using regular expressions to decode a message.
Remember, the stringr functions go inside dplyr verbs like mutate() and filter(). Think of them as you would as.factor()
[1] "How" "much" "wood" "could" "a" "woodchuck"
[7] "chuck" "if" "a" "woodchuck" "could" "chuck"
[13] "wood?"